Compare Revisions - nigel.stanger/Wiki

nigel.stanger / Wiki

Compare Revisions
View Page Back to Page History

Transcribing lectures using Whisper.md
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” * Normalising the audio beforehand may or may not help. The `speechnorm` filter seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`. For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Models are downloaded to `~/.cache/whisper`. Vibe seems to be a useful cross-platform GUI implementation. Implemented in Rust, optimised for Apple Silicon, and supports Core ML models, so much faster than vanilla Whisper (written in Python and not optimised). Seems to produce very short line lengths. it claims to have a CLI, but I can’t figure out how to make it work, and the “max sentence length” setting doesn’t seem to help. It produces not-quite-correct VTT: no WEBVTT header, and no blank line between entries. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic. Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Models are downloaded to `~/.cache/whisper`. Vibe seems to be a useful cross-platform GUI implementation. Implemented in Rust, optimised for Apple Silicon, and supports Core ML models, so much faster than vanilla Whisper (written in Python and not optimised). Seems to produce very short line lengths. it claims to have a CLI, but I can’t figure out how to make it work, and the “max sentence length” setting doesn’t seem to help. It produces not-quite-correct VTT: no WEBVTT header, and no blank line between entries. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.

Transcribing lectures using Whisper.md

Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing.

Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect.

Assuming a good quality recording, the following settings seem to do a good job:

* Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model.
* Enable word timestamps.
* Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”.
* VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.)
* Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else.
* An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.”
* Normalising the audio beforehand may or may not help. The `speechnorm` filter seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`.

For example:

```sh
whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file>
```

Models are downloaded to `~/.cache/whisper`.

**Vibe** seems to be a useful cross-platform GUI implementation. Implemented in Rust, optimised for Apple Silicon, and supports Core ML models, so much faster than vanilla Whisper (written in Python and not optimised). Seems to produce very short line lengths. it claims to have a CLI, but I can’t figure out how to make it work, and the “max sentence length” setting doesn’t seem to help. It produces not-quite-correct VTT: no WEBVTT header, and no blank line between entries.

VS Code extension issues: